Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Team and project meta information.

Project Title - Home Credit Default Risk

Group Number - 37

Team Member - Shriya Reddy Pulagam - spulagam@iu.edu

            - Srinivas Yashvanth Valaval - svalaval@iu.edu

            - Anoop Bulusu - srbulusu@iu.edu

            - Nanda Kishore Vallamkondu - nvallamk@iu.edu

aml.jpeg

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

Screen%20Shot%202022-04-30%20at%206.22.08%20PM.png

Application train

Indented block

Application test

The Other datasets

Exploratory Data Analysis

Summary of Application train

Splitting categorical and numerical features

Splitting Application train data

Missing data for application train

Distribution of the target column

Correlation with the target column

Applicants Age

Applicants occupations

VISUAL Exploratory Data Analysis

Distribution of application train dataset

Distribution of positively correlated features

Distribution of negatively correlated features

Plots comparing target variables with input features.

Observation: From the above graph, we can see that the people between age of 30 and 40 have maximum number who are at risk.

Summary of bureau

Summary of bureau_balance

Missing data for bureau_balance

Observation: As we can observe, bureaua_balance has no missing data.

Summary of Previous_application

Missing data for application train

Summary of POS_CASH_BALANCE

Missing data for POS_CASH_balance

Summary of credit_card_balance

Missing data for credit_card_balance

Summary of installments_payments

Missing data for installments_payments

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Feature Engineering and Selection

Feature Engineering on bureau balance

feature engineering on pos_cash_balance

feature engineering on credit card balance

feature engineering on installment payments

merging tertiary datasets with secondary datasets

Domain based features

merging bureau balance with bureau

Domain knowledge based features

Merging secondary with primary datasets

filling missing values with 0

Finding correlations for feature selection

Modeling pipelines

Logistic Regression

Random Forest with hyper parameter tuning

XGBoost with hyper parameter tuning

Multilayer Perceptron

AUROC plotting

Tensor board visualizations

Screen%20Shot%202022-05-01%20at%2010.17.43%20AM.png

Screen%20Shot%202022-05-01%20at%2010.17.55%20AM.png

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle test preprocessing

Kaggle submission via the command line API

WhatsApp%20Image%202022-04-21%20at%205.46.25%20AM.jpeg

report submission

Click on this link

Write-up

For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:

Abstract

To perform the Home Credit Default Risk project, we have extracted the data from the Kaggle competition 'Home Credit Risk Analysis' and performed Exploratory Data Analysis to understand and explore the data. Various visualizations were performed on most of the input features to the 'Target' variable to find the people at maximum risk. In the second phase of the project, the models built were overfitting. So, we have remodeled the old feature engineering on all the tables and added features such as AMT_DRAWING_ratio, DAYS_INSTALLMENTS_diff, and AMT_ANNUITY_ratio. Data leakage was taken into consideration throughout the project to retain the efficiency of the project. Modeled a neural network model and implemented Multi_layer Perceptron with the help of Pytorch. A Tensor board was used to visualize and track the training and validation loss of the models. The Machine Learning Classifiers such as Logistic Regression, Random Forest, and XGBoost have performed better. The accuracies of our models improved compared to phase 2. Logistic Regression achieved a test accuracy of 91.1% and a test ROC_AUC score of 0.6855, whereas XGBoost has a better ROC_AUC score, which is 0.7207. We have also implemented Multi-Layer Perceptron and achieved a test accuracy of 68% and a ROC_AUC score of 0.705. A Kaggle submission was made using the best performing model XGBoost and obtained a score of 0.7214

Introduction

Data Description

Data Description

In this project, seven different datasets are utilized.

● Application_{train|test}.csv:

This is the primary dataset, it consists of 307,511 observations and 122 different variables which provide data on all the applicants. In this dataset, each applicant has a row with the Target variables 0 and

  1. Here, 0 indicates that the loan was repaid and 1 indicates that the loan was not paid. All the data points in this dataset are static for all applicants.

● Bureau.csv:

Here, the data consists of the customer's previous credits from different financial companies that were reported to the Credit Bureau. Every previous credit has its own row, unlike application data where one loan can have multiple previous credits.

● Bureau_Balance.csv:

This data has monthly balances of previous credit. Single previous credit can have multiple rows, one for every month in the period of credit length.

● Credit_card_balance.csv:

All the monthly balances of previous credits taken by an individual with the Home Credit. A single credit card can have multiple rows with each row consisting of every month's balance.

● Previous_application.csv:

Data consists of each person's previous loan application at Home Credit. Current loan applicants can have multiple previous loan applications.

● POS_CASH_balance.csv:

In this dataset, monthly data of the previous point of sale or cash loan clients had with Home Credit are present. A single previous loan can have multiple rows with each row representing one-month data of the previous point of sale or cash loan.

● Installments_payments.csv:

The previous loan payment data is present with Home Credit. The data has both information on every missed and made payment.

Tasks to be handled

-> Handle overfitting

-> Remodel feature engineering and feature selection

-> Correlation on numerical features

-> Implementing Multilayer Perceptron using Pytorch

-> Tensor to visualize the results of training

Pipelines

Families of input features: count of numerical features: 182(int64, float64) count of categorical features: 13(object)

The total number of input features: 195 input features.

we have trained two models:

  1. Logistic Classifier
  2. RandomForestClassifier
  3. XGBoost
  4. Multilayer Perceptron

aml_pipeline_new.jpg

Data Leakage

Cardinal sins avoided:

Log function

WhatsApp%20Image%202022-04-21%20at%205.53.48%20AM.jpeg

Gini log loss

WhatsApp%20Image%202022-04-21%20at%205.54.03%20AM.jpeg

Experimental results

Discussion of Results

In the last phase, the models were overfitting and gave an accuracy of 1. We improved this, and the models performed better compared to phase 2.

Logistic Regression scores:

Random Forest scores:

XGBoost scores:

In the final phase, we have also implemented the MLP model using Pytorch. And obtained the following scores

Comparitively XGBoost performed better among all models,Kaggle submission was made on XGboost and got a ascore 0.72147

Conclusion

This project's main goal is to develop a Machine Learning model that can predict whether or not a loan applicant will be able to repay the loan. Without any statistical analysis, many deserving applicants with no credit history or default history get approved. The HCDR dataset is used to train the machine learning model. Based on the history of comparable applicants in the past, it will be able to estimate whether or not an applicant would be able to repay their loan. This model would help in the screening of applications by providing statistical support resulting from numerous aspects taken into account.

In the previous phase of the project the models were overfitting the data and hence got an accuracy score of 1, to overcome this we added new features using the pipelines, and tables were merged considering inherent hierarchy. A neural network model, Multi-layer Perceptron was implemented using PyTorch in the final phase, and MLP was modeled with two hidden layers, eight neurons, and 166 input dimensions. This resulted in an accuracy of 68% with an AUC score of 0.705. Out of all models implemented, XGboost gave the best results resulting in an accuracy of 92.2 and an AUC score of 0.720. We have also made a Kaggle submission and obtained a score of 0.7214.

Kaggle Submission

WhatsApp%20Image%202022-05-01%20at%2010.16.56%20AM.jpg

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: